trainer: Flux policy user-guide by vsoch · Pull Request #4283 · kubeflow/website

vsoch · 2026-01-19T02:18:43Z

Description of Changes

This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset.

Related Issues

This is linked with a pull request to the trainer,

Related: kubeflow/trainer#3064.

I did not open an issue here (and can if needed, please let me know).

Checklist

You have signed off your commits
Ensure you follow best practices from our contributing guide.
(for big changes) I will post screenshots of the changes in a PR comment

cc @milroy

This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

google-oss-prow · 2026-01-19T02:18:53Z

Hi @vsoch. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2026-01-19T02:19:04Z

🚫 This command cannot be processed. Only organization members or owners can use the commands.

Arhell

/ok-to-test

andreyvelich · 2026-03-10T10:03:55Z

@vsoch Can you update this guide for Flux in Trainer please?

vsoch · 2026-03-10T19:56:55Z

Sure thing - I'll bring up a cluster on AWS today and test out using the (now merged) main branch to run it.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

google-oss-prow · 2026-03-11T01:39:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

content/en/docs/components/trainer/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vsoch · 2026-03-11T01:40:23Z

@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa?

vsoch · 2026-03-11T01:50:58Z

Found it! Putting here so I remember next time. 🙃

$ kubectl explain trainjob.spec.trainer.resourcesPerNode
GROUP:      trainer.kubeflow.org
KIND:       TrainJob
VERSION:    v1alpha1

FIELD: resourcesPerNode <Object>


DESCRIPTION:
    resourcesPerNode defines the compute resources for each training node.
    
FIELDS:
  claims	<[]Object>
    Claims lists the names of resources, defined in spec.resourceClaims,
    that are used by this container.
    
    This field depends on the
    DynamicResourceAllocation feature gate.
    
    This field is immutable. It can only be set for containers.

  limits	<map[string]Object>
    Limits describes the maximum amount of compute resources allowed.
    More info:
    https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

  requests	<map[string]Object>
    Requests describes the minimum amount of compute resources required.
    If Requests is omitted for a container, it defaults to Limits if that is
    explicitly specified,
    otherwise to an implementation-defined value. Requests cannot exceed Limits.
    More info:
    https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

vsoch · 2026-03-11T01:52:48Z

@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week!

https://events.linuxfoundation.org/hpsf-conference/program/schedule/

The best part of that abstract might be the title :)

https://www.youtube.com/watch?v=hdcTmpvDO0I

andreyvelich · 2026-03-11T14:31:46Z

@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa?

Sure, you can add this into Flux examples: https://github.com/kubeflow/trainer/tree/master/examples/flux or Trainer documentation.

@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week!

This is awesome! We should definitely promote it through our outreach channels. cc: @kubeflow/kubeflow-outreach-committee @kubeflow/kubeflow-steering-committee @kubeflow/kubeflow-trainer-team.

@tarekabouzeid @yashpal2104, could you help share this on Kubeflow’s social channels? Highlighting that Kubeflow Trainer is being used for HPC workloads would be especially impactful and highly relevant for the AI community.

google-oss-prow bot added the needs-ok-to-test label Jan 19, 2026

google-oss-prow bot requested review from Jeffwan and andreyvelich January 19, 2026 02:18

google-oss-prow bot added area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator size/L labels Jan 19, 2026

vsoch mentioned this pull request Jan 19, 2026

feat: support for Flux Framework as HPC manager kubeflow/trainer#3064

Closed

1 task

Arhell reviewed Jan 20, 2026

View reviewed changes

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Jan 20, 2026

juliusvonkohout assigned andreyvelich Feb 1, 2026

flux: update to use trainer repository assets

97588d2

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer: Flux policy user-guide#4283

trainer: Flux policy user-guide#4283
vsoch wants to merge 2 commits intokubeflow:masterfrom
vsoch:user-guide/flux

vsoch commented Jan 19, 2026 •

edited

Loading

Uh oh!

google-oss-prow bot commented Jan 19, 2026

Uh oh!

github-actions bot commented Jan 19, 2026

Uh oh!

Arhell left a comment

Uh oh!

andreyvelich commented Mar 10, 2026

Uh oh!

vsoch commented Mar 10, 2026

Uh oh!

google-oss-prow bot commented Mar 11, 2026

Uh oh!

vsoch commented Mar 11, 2026

Uh oh!

vsoch commented Mar 11, 2026

Uh oh!

vsoch commented Mar 11, 2026

Uh oh!

andreyvelich commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vsoch commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of Changes

Related Issues

Checklist

Uh oh!

google-oss-prow bot commented Jan 19, 2026

Uh oh!

github-actions bot commented Jan 19, 2026

Uh oh!

Arhell left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Mar 10, 2026

Uh oh!

vsoch commented Mar 10, 2026

Uh oh!

google-oss-prow bot commented Mar 11, 2026

Uh oh!

vsoch commented Mar 11, 2026

Uh oh!

vsoch commented Mar 11, 2026

Uh oh!

vsoch commented Mar 11, 2026

Uh oh!

andreyvelich commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vsoch commented Jan 19, 2026 •

edited

Loading